Packet prioritization and associated bandwidth and buffer management techniques for audio over IP

ABSTRACT

The present invention is directed to voice communication devices in which an audio stream is divided into a sequence of individual packets, each of which is routed via pathways that can vary depending on the availability of network resources. All embodiments of the invention rely on an acoustic prioritization agent that assigns a priority value to the packets. The priority value is based on factors such as whether the packet contains voice activity and the degree of acoustic similarity between this packet and adjacent packets in the sequence. A confidence level, associated with the priority value, may also be assigned. In one embodiment, network congestion is reduced by deliberately failing to transmit packets that are judged to be acoustically similar to adjacent packets; the expectation is that, under these circumstances, traditional packet loss concealment algorithms in the receiving device will construct an acceptably accurate replica of the missing packet. In another embodiment, the receiving device can reduce the number of packets stored in its jitter buffer, and therefore the latency of the speech signal, by selectively deleting one or more packets within sustained silences or non-varying speech events. In both embodiments, the ability of the system to drop appropriate packets may be enhanced by taking into account the confidence levels associated with the priority assessments.

FIELD OF THE INVENTION

[0001] The present invention relates generally to audio communicationsover distributed processing networks and specifically to voicecommunications over data networks.

BACKGROUND OF THE INVENTION

[0002] Convergence of the telephone network and the Internet is drivingthe move to packet-based transmission for telecommunication networks. Aswill be appreciated, a “packet” is a group of consecutive bytes (e.g., adatagram in TCP/IP) sent from one computer to another over a network. InInternet Protocol or IP telephony or Voice Over IP (VoIP), a telephonecall is sent via a series of data packets on a fully digitalcommunication channel. This is effected by digitizing the voice stream,encoding the digitized stream with a codec, and dividing the digitizedstream into a series of packets (typically in 20 millisecondincrements). Each packet includes a header, trailer, and data payload ofone to several frames of encoded speech. Integration of voice and dataonto a single network offers significantly improved bandwidth efficiencyfor both private and public network operators.

[0003] In voice communications, high end-to-end voice quality in packettransmission depends principally on the speech codec used, theend-to-end delay across the network and variation in the delay (jitter),and packet loss across the channel. To prevent excessive voice qualitydegradation from transcoding, it is necessary to control whether andwhere transcodings occur and what combinations of codecs are used.End-to-end delays on the order of milliseconds can have a dramaticimpact on voice quality. When end-to-end delay exceeds about 150 to 200milliseconds one way, voice quality is noticeably impaired. Voicepackets can take an endless number of routes to a given destination andcan arrive at different times, with some arriving too late for use bythe receiver. Some packets can be discarded by computational componentssuch as routers in the network due to network congestion. When an audiopacket is lost, one or more frames are lost too, with a concomitant lossin voice quality.

[0004] Conventional VoIP architectures have developed techniques toresolve network congestion and relieve the above issues. In onetechnique, voice activity detection (VAD) or silence suppression isemployed to detect the absence of audio (or detect the presence ofaudio) and conserve bandwidth by preventing the transmission of “silent”packets over the network. Most conversations include about 50% silence.When only silence is detected for a specified amount of time, VADinforms the Packet Voice Protocol and prevents the encoder output frombeing transported across the network. VAD is, however, unreliable andthe sensitivity of many VAD algorithms imperfect. To exacerbate theseproblems, VAD has only a binary output (namely silence or no silence)and in borderline cases must decide whether to drop or send the packet.When the “silence” threshold is set too low, VAD is rendered meaninglessand when too high audio information can be erroneously classified as“silence” and lost to the listener. The loss of audio information cancause the audio to be choppy or clipped. In another technique, a receivebuffer is maintained at the receiving node to provide additional timefor late and out-of-order packets to arrive. Typically, the buffer has acapacity of around 150 milliseconds. Most but not all packets willarrive before the time slot for the packet to be played is reached. Thereceive buffer can be filled to capacity at which point packets may bedropped. In extreme cases, substantial, consecutive parts of the audiostream are lost due to the limited capacity of the receive bufferleading to severe reductions in voice quality. Although packet lossconcealment algorithms at the receiver can reconstruct missing packets,packet reconstruction is based on the contents of one or more temporallyadjacent packets which can be acoustically dissimilar to the missingpacket(s), particularly when several consecutive packets are lost, andtherefore the reconstructed packet(s) can have very little relation tothe contents of the missing packet(s).

SUMMARY OF THE INVENTION

[0005] These and other needs are addressed by the various embodimentsand configurations of the present invention. The present invention isdirected generally to a computational architecture for efficientmanagement of transmission bandwidth and/or receive buffer latency.

[0006] In one embodiment of the present invention, a transmitter for avoice stream is provided that comprises:

[0007] (a) a packet protocol interface operable to convert one or moreselected segments (e.g., frames) of the voice stream into a packet and

[0008] (b) an acoustic prioritization agent operable to controlprocessing of the selected segment and/or packet based on one or more of(i) a level of confidence that the contents of the selected segment arenot the product of voice activity (e.g., are silence), (ii) a type ofvoice activity (e.g., plosive) associated with or contained in thecontents of the selected segment, and (iii) a degree of acousticsimilarity between the selected segment and another segment of the voicestream.

[0009] The level of confidence permits the voice activity detector toprovide a ternary output as opposed to the conventional binary output.The prioritization agent can use the level of confidence in the ternaryoutput, possibly coupled with one or measures of the traffic patterns onthe network, to determine dynamically whether or not to send the“silent” packet and, if so, use a lower transmission priority or classfor the packet.

[0010] The type of voice activity permits the prioritization agent toidentify extremely important parts of the voice stream and assign ahigher transmission priorities and/or class to the packet(s) containingthese parts of the voice stream. The use of a higher transmissionpriority and/or class can significantly reduce the likelihood that thepacket(s) will arrive late, out of order, or not at all.

[0011] The comparison of temporally adjacent packets to yield a degreeof acoustic similarity permits the prioritization agent to controlbandwidth effectively. The agent can use the degree of similarity,possibly coupled with one or measures of the traffic patterns on thenetwork, to determine dynamically whether or not to send a “similar”packet and, if so, use a lower transmission priority or class for thepacket. Packet loss concealment algorithms at the receiver can be usedto reconstruct the omitted packet(s) to form a voiced signal thatclosely matches the original signal waveform. Compared to conventionaltransmission devices, fewer packets can be sent over the network torealize an acceptable signal waveform.

[0012] In another embodiment of the present invention, a receiver for avoice stream is provided that comprises:

[0013] (a) a receive buffer containing a plurality of packets associatedwith voice communications; and

[0014] (b) a buffer manager operable to remove some of the packets fromthe receive buffer while leaving other packets in the receive bufferbased on a level of importance associated with the packets.

[0015] In one configuration, the level of importance of the each of thepackets is indicated by a corresponding value marker. The level ofimportance or value marker can be based on any suitable criteria,including a level of confidence that contents of the packet containvoice activity, a degree of similarity of temporally adjacent packets,the significance of the audio in the packet to receiver understanding orfidelity, and combinations thereof.

[0016] In another configuration, the buffer manager performs timecompression around the removed packet(s) to prevent reconstruction ofthe packets by the packet loss concealment algorithm. This can beperformed by, for example, resetting a packet counter indicating anordering of the packets, such as by assigning the packet counter of theremoved packet to a packet remaining in the receive buffer.

[0017] In another configuration, the buffer manager only removespacket(s) from the buffer when the buffer delay or capacity equals orexceeds a predetermined level. When the buffer is not in an overcapacitysituation, it is undesirable to degrade the quality of voicecommunications, even if only slightly.

[0018] The various embodiments of the present invention can provide anumber of advantages. First, the present invention can decreasesubstantially network congestion by dropping unnecessary packets,thereby providing lower end-to-end delays across the network, lowerdegrees of variation in the delay (jitter), and lower levels of packetloss across the channel. Second, the various embodiments of the presentinvention can handle effectively the bursty traffic and best-effortdelivery problems commonly encountered in conventional networks whilemaintaining consistently and reliably high levels of voice qualityreliably. Third, voice quality can be improved relative to conventionalvoice activity detectors by not discarding “silent” packets inborderline cases.

[0019] These and other advantages will be apparent from the disclosureof the invention(s) contained herein.

[0020] The above-described embodiments and configurations are neithercomplete nor exhaustive. As will be appreciated, other embodiments ofthe invention are possible utilizing, alone or in combination, one ormore of the features set forth above or described in detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021]FIG. 1 is a block diagram of a simple network for a VoIP sessionbetween two endpoints according to a first embodiment of the presentinvention;

[0022]FIG. 2 is a block diagram of the functional components of atransmitting voice communication device according to the firstembodiment;

[0023]FIG. 3 is a block diagram of the functional components of areceiving voice communication device according to the first embodiment;

[0024]FIG. 4 is a flow chart of a voice activity detector according to asecond embodiment of the present invention;

[0025]FIG. 5 is a flow chart of a codec according to a third embodimentof the present invention;

[0026]FIG. 6 is a flow chart of a packet prioritizing algorithmaccording to a second embodiment of the present invention;

[0027]FIG. 7 is a block diagram illustrating time compression accordingto a fourth embodiment of the present invention; and

[0028]FIG. 8 is a flow chart of a buffer management algorithm accordingto the fourth embodiment of the present invention.

DETAILED DESCRIPTION

[0029]FIG. 1 is a simplistic VoIP network architecture according to afirst embodiment of the present invention. First and second voicecommunication devices 100 and 104 transmit and receive VoIP packets. Thepackets can be transmitted over one of two paths. The first and shortestpath is via networks 108 and 112 and router 116. The second and longerpath is via networks 108, 112, and 120 and routers 124 and 128.Depending upon the path followed, the packets can arrive at either ofthe communication devices at different times. As will be appreciated,network architectures suitable for the present invention can include anynumber of networks and routers and other intermediate nodes, such astranscoding gateways, servers, switches, base transceiver stations, basestation controllers, modems, router, and multiplexers and employ anysuitable packet-switching protocols, whether using connection orientedor connectionless services, including without limitation InternetProtocol or IP, Ethernet, and Asynchronous Transfer Mode or ATM.

[0030] As will be further appreciated, the first and second voicecommunication devices 100 and 104 can be any communication devicesconfigured to transmit and/or receive packets over a data network, suchas the Internet. For example, the voice communication devices 100 and104 can be a personal computer, a laptop computer, a wired analog ordigital telephone, a wireless analog or digital telephone, intercom, andradio or video broadcast studio equipment.

[0031]FIG. 2 depicts an embodiment of a transmitting voice communicationdevice. The device 200 includes, from left to right, a first userinterface 204 for outputting signals inputted by the first user (notshown) and an outgoing voice stream 206 received from the first user, ananalog-to-digital converter 208, a Pulse Code Modulation or PMCinterface 212, an echo canceller 216, a Voice Activity Detector or VAD220, a voice codec 224, a packet protocol interface 228 and an acousticprioritizing agent 232.

[0032] The first user interface 204 is conventional and be configured inmany different forms depending upon the particular implementation. Forexample, the user interface 204 can be configured as an analog telephoneor as a PC.

[0033] The analog-to-digital converter 208 converts, by knowntechniques, the analog outgoing voice stream 206 received from the firstuser interface 204 into an outgoing digital voice stream 210.

[0034] The PCM interface 212, inter alia, forwards the outgoing digitalvoice stream 210 to appropriate downstream processing modules forprocessing.

[0035] The echo canceller 216 performs echo cancellation on the digitalstream 214, which is commonly a sampled, full-duplex voice port signal.Echo cancellation is preferably G. 165 compliant.

[0036] The VAD 220 monitors packet structures in the incoming digitalvoice stream 216 received from the echo canceller 216 for voiceactivity. When no voice activity is detected for a configurable periodof time, the VAD 220 informs the acoustic prioritizing agent 232 of thecorresponding packet structure(s) in which no voice activity wasdetected and provides a level of confidence that the correspondingpacket structure(s) contains no meaningful voice activity. This outputis typically provided on a packet structure-by-packet structure basis.These operations of the VAD are discussed below with reference to FIG.4.

[0037] VAD 220 can also measure the idle noise characteristics of thefirst user interface 204 and report this information to the packetprotocol interface 228 in order to relay this information to the othervoice communication device for comfort noise generation (discussedbelow) when no voice activity is detected.

[0038] The voice codec 224 encodes the voice data in the packetstructures for transmission over the data network and compares theacoustic information (each frame of which includes spectral informationsuch as sound or audio amplitude as a function of frequency) intemporally adjacent packet structures and assigns to each packet anindicator of the difference between the acoustic information in adjacentpacket structures. These operations are discussed below with referenceto FIG. 5. As shown in box 236, the voice codec typically include, inmemory, numerous voice codecs capable of different compression ratios.Although only codecs G.711, G,723.1, G.726, G.728, and G.729 are shown,it is to be understood that any voice codec whether known currently ordeveloped in the future could be in memory. Voice codecs encode and/orcompress the voice data in the packet structures. For example, acompression of 8:1 is achievable with the G.729 voice codec (thus thenormal 64 Kbps PCM signal is transmitted in only 8 Kbps). The encodingfunctions of codecs are further described in Michaelis, SpeechDigitization and Compression, in the International Encyclopedia ofErgonomics and Human Factors, edited by Warkowski, 2001; ITU-TRecommendation G.729 General Aspects of Digital Transmission Systems,Coding of Speech at 8 kbit/s using Conjugate-StructureAlgebraic-Code-Excited Linear-Prediction, March 1996; and Mahfuz, PacketLoss Concealment for Voice Transmission Over IP Networks, September2001, each of which is incorporated herein by this reference.

[0039] The prioritization agent 232 efficiently manages the transmissionbandwidth and the receive buffer latency. The prioritization agent (a)determines for each packet structure, based on the correspondingdifference in acoustic information between the selected packet structureand a temporally adjacent packet structure (received from the codec), arelative importance of the acoustic information contained in theselected packet structure to maintaining an acceptable level of voicequality and/or (b) determines for each packet structure containingacoustic information classified by the VAD 220 as being “silent” arelative importance based on the level of confidence (output by the VADfor that packet structure) that the acoustic information corresponds tono voice activity. The acoustic prioritization agent, based on thediffering levels of importance, causes the communication device toprocess differently the packets corresponding to the packet structures.The packet processing is discussed in detail below with reference toFIG. 6.

[0040] The packet protocol interface 228 assembles into packets andsequences the outgoing encoded voice stream and configures the packetheaders for the various protocols and/or layers required fortransmission to the second voice communication device 300 (FIG. 3).Typically, voice packetization protocols use a sequence number field inthe transmit packet stream to maintain temporal integrity of voiceduring playout. Under this approach, the transmitter inserts apacketcounter, such as the contents of a free-running, modulo-16 packetcounter, into each transmitted packet, allowing the receiver to detectlost packets and properly reproduce silence intervals during playout atthe receiving communication device. In one configuration, the importanceassigned by the acoustic prioritizing agent can be used to configure thefields in the header to provide higher or lower transmission priorities.This option is discussed in detail below in connection with FIG. 6.

[0041] The packetization parameters, namely the packet size and thebeginning and ending points of the packet are communicated by the packetprotocol interface 228 to the VAD 220 and codec 224 via the acousticprioritization agent 232. The packet structure represents the portion ofthe voice stream that will be included within a corresponding packet'spayload. In other words, a one-to-one correspondence exists between eachpacket structure and each packet. As will be appreciated, it isimportant that packetization parameter synchronization be maintainedbetween these components to maintain the integrity of the output of theacoustic prioritization agent.

[0042]FIG. 3 depicts an embodiment of a receiving (or second) voicecommunication device 300. The device 300 includes, from right to left,the packet protocol interface 228 to remove the header information fromthe packet payload, the voice codec 224 for decoding and/ordecompressing the received packet payloads to form an incoming digitalvoice stream 302, an adaptive playout unit 304 to process the receivedpacket payloads, the echo canceller 216 for performing echo cancellationon the incoming digital voice stream 306, the PCM interface 212 forperforming continuous phase resampling of the incoming digital voicestream 316 to avoid sample slips and forwarding the echo cancelledincoming voice stream 316 to a digital-to-analog converter 308 thatconverts the echo cancelled incoming voice stream 320 into an analogvoice stream 324, and second user interface 312 for outputting to thesecond user the analog voice stream 324.

[0043] The adaptive playout unit 304 includes apacket loss concealmentagent 328, a receive buffer 336, and a receive buffer manager 332. Theadaptive playout unit 304 can further include a continuous-phaseresampler (not shown) that removes timing frequency offset withoutcausing packet slips or loss of data for voice or voiceband modemsignals and a timing jitter measurement module (not shown) that allowsadaptive control of FIFO delay.

[0044] The packet loss concealment agent 328 reconstructs missingpackets based on the contents of temporally adjacent received packets.As will be appreciated, the packet loss concealment agent can performpacket reconstruction in a multiplicity of ways, such as replaying thelast packet in place of the lost packet and generating synthetic speechusing a circular history buffer to cover the missing packet. Preferredpacket loss concealment algorithms preserve the spectral characteristicsof the speaker's voice and maintain a smooth transition between theestimated signal and the surrounding original. In one configuration,packet loss concealment is performed by the codec.

[0045] The receive buffer 336 alleviates the effects of late packetarrival by buffering received voice packets. In most applications thereceive buffer 336 is a First-In-First-Out or FIFO buffer that storesvoice codewords before playout and removes timing jitter from theincoming packet sequence. As will be appreciated, the buffer 336 candynamically increase and decrease in size as required to deal with latepackets when the network is uncongested while avoiding unnecessarydelays when network traffic is congested.

[0046] The buffer manager 332 efficiently manages the increase inlatency (or end-to-end delay) introduced by the receive buffer 336 bydropping (low importance) enqueued packets as set forth in detail belowin connection with FIGS. 7 and 8.

[0047] In addition to packet payload decryption and/or decompression,the voice codec 228 can also include a comfort noise generator (notshown) that, during periods of transmit silence when no packets aresent, generates a local noise signal that is presented to the listener.The generated noise attempts to match the true background noise. Withoutcomfort noise, the listener can conclude that the line has gone dead.

[0048] Analog-to-digital and digital-to-analog converters 208 and 308,the pulse code modulation interface 212, the echo canceller 216 a and b,packet loss concealment agent 328, and receive buffer 336 areconventional.

[0049] Although FIGS. 2 and 3 depict voice communication devices insimplex configurations, it is to be understood that each of the voicecommunication devices 200 and 300 can act both as a transmitter andreceiver in a duplexed configuration.

[0050] The operation of the VAD 220 will now be described with referenceto FIGS. 2 and 4.

[0051] In the first step 400, the VAD 220 gets packet structure from theecho canceled digital voice stream 218. Packet structure counter i isinitially set to one. In step 404, the VAD 220 analyzes the acousticinformation in packet structure, to identify by known techniques whetheror not the acoustic information qualifies as “silence” or “no silence”and determine a level of confidence that the acoustic information doesnot contain meaningful or valuable acoustic information. The level ofconfidence can be determined by known statistical techniques, such asenergy level measurement, least mean square adaptive filter (Widrow andHoff 1959), and other Stochastic Gradient Algorithms. In oneconfiguration, the acoustic threshold(s) used to categorize frames orpackets as “silence” versus “nonsilence” vary dynamically, dependingupon the traffic congestion of the network. The congestion of thenetwork can be quantified by known techniques, such as by jitterdetermined by the timing measurement module (not shown) in the adaptiveplayout unit of the sending or receiving communication device, whichwould be forwarded to the VAD 220. Other selected parameters includelatency or end-to-end delay, number of lost or dropped packets, numberof packets received out-of-order, processing delay, propagation delay,and receive buffer delay/length. When the selected parameter(s) reach orfall below selected levels, the threshold can be reset to predeterminedlevels.

[0052] In step 408, the VAD 220 next determines whether or not packetstructure_(J), is categorized as “silent” or “nonsilent”. When packetstructure_(j) is categorized as being “silent”, the VAD 220, in step412, notifies the acoustic prioritization agent 232 of the packetstructure_(j) beginning and/or endpoint(s), packet length, the “silent”categorization of packet structure_(J), and the level of confidenceassociated with the “silent” categorization of packet structure_(j).When packet structure_(J) is categorized as “nonsilent” or after step412, the VAD 220 in step 416 sets counter j equal to j+1 and in step 420determines whether there is a next packet structure_(J) If so, VAD 220returns to and repeats step 400. If not, VAD 220 terminates operationuntil a new series of packet structures is received.

[0053] The operation of the codec 224 will now be described withreference to FIGS. 2 and 5. In steps 500, 504 and 512, respectively, thecodec 224 gets packet structure_(j), packet structure_(J−1), and packetstructure_(j+1). Packet structure counter j is, of course, initially setto one.

[0054] In steps 508 and 516, respectively, the codec 224 compares packetstructure_(j) with packet structure_(j−1), and packet structure_(J),with packet structure_(J+1). As will be appreciated, the comparison canbe done by any suitable technique, either currently or in the futureknown by those skilled in the art. For example, the amplitude and/orfrequency waveforms (spectral information) formed by the collectiveframes in each packet can be mathematically compared and thedifference(s) quantified by one or more selected measures or simply by abinary output such as “similar” or “dissimilar”. Acoustic comparisontechniques are discussed in Michaelis, et a., A Human Factors Engineer'sIntroduction to Speech Synthesizers, in Directions in Human-ComputerInteraction, edited by Badre, et al., 1982, which is incorporated hereinby this reference. If a binary output is employed, the thresholdselected for the distinction between “similar” and “dissimilar” can varydynamically based on one or more selected measures or parameters ofnetwork congestion. Suitable measures or parameters include those setforth previously. When the measures increase or decrease to selectedlevels the threshold is varied in a predetermined fashion.

[0055] In step 520, the codec 224 outputs the packet structuresimilarities/nonsimilarities determined in steps 508 and 516 to theacoustic prioritization agent 232. Although not required, the codec 224can further provide a level of confidence regarding the binary output.The level of confidence can be determined by any suitable statisticaltechniques, including those set forth previously. Next in step 524, thecodec encodes packet structure_(J). As will be appreciated, thecomparison steps 508 and 516 and encoding step 524 can be conducted inany order, including in parallel. The counter is incremented in step528, and in step 532, the codec determines whether or not there is anext packet structure_(J).

[0056] The operation of the acoustic prioritization agent 232 will nowbe discussed with reference to with FIGS. 2 and 6.

[0057] In step 600, the acoustic prioritizing agent 232 gets packet_(j)(which corresponds to packet structure_(j)). In step 604, the agent 232determines whether VAD 220 categorized packet structure_(j) as“silence”. When the corresponding packet structure_(J) has beencategorized as “silence”, the agent 232, in step 608, processespacket_(J) based on the level of confidence reported by the VAD 220 forpacket structure_(J).

[0058] The processing of “silent” packets can take differing forms. Inone configuration, a packet having a corresponding level of confidenceless than a selected silence threshold Y is dropped. In other words, theagent requests the packet protocol interface 228 to prevent packet_(j)from being transported across the network. A “silence” packet having acorresponding level of confidence more than the selected threshold issent. The priority of the packet can be set at a lower level than thepriorities of “nonsilence” packets. “Priority” can take many formsdepending on the particular protocols and network topology in use. Forexample, priority can refer to a service class or type (for protocolssuch as Differentiated Services and Internet Integrated Services), andpriority level (for protocols such as Ethernet). For example, “silent”packets can be sent via the assured forwarding class while “nonsilence”packets are sent via the expedited forwarding (code point) class. Thiscan be done, for example, by suitably marking, in the Type of Service orTraffic Class fields, as appropriate. In yet another configuration, avalue marker indicative of the importance of the packet to voice qualityis placed in the header and/or payload of the packet. The value markercan be used by intermediate nodes, such as routers, and/or by the buffermanager 332 (FIG. 3) to discard packets in appropriate applications. Forexample, when traffic congestion is found to exist using any of theparameters set forth above, value markers having values less than apredetermined level can be dropped during transit or after reception.This configuration is discussed in detail with reference to FIGS. 7 and8. Multiple “silence” packet thresholds can be employed for differingtypes of packet processing, depending on the application. As will beappreciated, the various thresholds can vary dynamically depending onthe degree of network congestion as set forth previously.

[0059] When the corresponding packet structure_(J) has been categorizedas “nonsilence”, the agent 232, in step 618, determines whether thedegree of similarity between the corresponding packet structure_(j) andpacket structure_(j−1) (as determined by the codec 224) is greater thanor equal to a selected similarity threshold X. If so, the agent 232proceeds to step 628 (discussed below). If not, the agent 232 proceedsto step 624. In step 624, the agent determines whether the degree ofsimilarity between the corresponding packet structure_(j) and packetstructure_(J+)(as determined by the codec 224) is greater than or equalto the selected similarity threshold X. If so, the agent 232 proceeds tostep 628.

[0060] In step 628, the agent 232 processes packet_(j) based on themagnitude of the degree of similarity and/or on the treatment of thetemporally adjacent packet_(J−). As in the case of “silent” packets, theprocessing of similar packets can take differing forms. In oneconfiguration, a packet having a degree of similarity more than theselected similarity threshold X is dropped. In other words, the agentrequests the packet protocol interface 228 to prevent packet_(J) frombeing transported across the network. The packet loss concealment agent328 (FIG. 3) in the second communication device 300 will reconstruct thedropped packet. In that event, the magnitude of X is determined by thepacket reconstruction efficiency and accuracy of the packet lossconcealment algorithm. If the preceding packet_(j−)were dropped,packet_(j) may be forwarded, as the dropping of too many consecutivepackets can have a detrimental impact on the efficiency and accuracy ofthe packet loss concealment agent 328. In another configuration,multiple transmission priorities are used depending on the degree ofsimilarity. For example, a packet having a degree of similarity morethan the selected threshold is sent with a lower priority. The priorityof the packet is set at a lower level than the priorities of dissimilarpackets. As noted above, “priority” can take many forms depending on theparticular protocols and network topology in use. In yet anotherconfiguration, the value marker indicative of the importance of thepacket to voice quality is placed in the header and/or payload of thepacket. The value marker can be used as set forth previously and belowto cause the dropping of packets having value markers below one or moreselected marker value thresholds. Multiple priority levels can beemployed for multiple similarity thresholds, depending on theapplication. As will be appreciated, the various similarity and markervalue thresholds can vary dynamically depending on the degree of networkcongestion as set forth previously.

[0061] After steps 608 and 628 and in the event in step 624 that thesimilarity between the corresponding packet structure_(j) and packetstructure_(J+1), (as determined by the codec 224) is less than theselected similarity threshold X, the agent 232 proceeds to step 612. Instep 612, the counter j is incremented by one. In step 616, the agent232 determines whether there is a next packet_(j). When there is a nextpacket_(j), the agent 232 proceeds to and repeats step 600. When thereis no next packet_(j), the agent 232 proceeds to step 632 and terminatesoperation until more packet structures are received for packetization.

[0062] The operation of the buffer manager 332 will now be describedwith reference to FIGS. 3 and 7-8. In step 800, the buffer manager 332determines whether the buffer delay (or length) is greater than or equalto a buffer threshold Y. If not, the buffer manager 332 repeats step800. If so, the buffer manager 332 in step 804 gets packet_(k) from thereceive buffer 336. Initially, of course the counter k is set to 1 todenote the packet in the first position in the receive buffer (or at thehead of the buffer). Alternatively, the manager 332 can retrieve thelast packet in the receive buffer (or at the tail of the buffer).

[0063] In step 808, the manager 332 determines if the packet isexpendable; that is, whether the value of the value marker is less than(or greater depending on the configuration) a selected value threshold.When the value of the value marker is less than the selected valuethreshold, the packet_(k) in step 812 is discarded or removed from thebuffer and in step 816 the surrounding enqueued packets are timecompressed around the slot previously occupied by packet_(k).

[0064] Time compression is demonstrated with reference to FIG. 7. Thebuffer 336 is shown as having various packets 700 a-e, each packetpayload representing a corresponding time interval of the voice stream.If the manager determines that packet 700 b (which corresponds to thetime interval t₂ to t₃) is expendable, the manager 332 first removes thepacket 700 b from the queue 336 a and then moves packets 700 c-e aheadin the queue. To perform time compression, the packet counters forpackets 700 c-e are decremented such that packet 700 c now occupies thetime slot t₂ to t₃, packet 700 d time slot t₂ to t₃, and packet 700 dtime slot t₃ to t₄. In this manner, the packet loss concealment agent328 will be unaware that packet 700 b has been discarded and will notattempt to reconstruct the packet. In contrast, if a packet is omittedfrom an ordering of packets, the packet loss concealment agent 328 willrecognize the omission by the break in the packet counter sequence. Theagent 328 will then attempt to reconstruct the packet.

[0065] Returning again to FIG. 8, the manager 332 in step 820 incrementsthe counter k and repeats step 800 for the next packet.

[0066] A number of variations and modifications of the invention can beused. It would be possible to provide for some features of the inventionwithout providing others.

[0067] For example in one alternative embodiment, the prioritizingagent's priority assignment based on the type of “silence” detected canbe performed by the VAD 200.

[0068] In another alternative embodiment though FIG. 2 is suitable foruse with a VoIP architecture using Embedded Communication Objectsinterworking with a telephone system and packet network, it is to beunderstood that the configuration of the VAD 220, codec 224,prioritizing agent 232 and/or buffer manager 332 of the presentinvention can vary significantly depending upon the application and theprotocols employed. For example, the prioritizing agent 232 can beincluded in an alternate location in the embodiment of FIG. 2, and thebuffer manager in an alternate location in the embodiment of FIG. 3. Theprioritizing agent and/or buffer manager can interface with differentcomponents than those shown in FIG. 2 for other types of userinterfaces, such as a PC, wireless telephone, and laptop. Theprioritizing agent and/or buffer manager can be included in anintermediate node between communication devices, such as in a switch,transcoding device, translating device, router, gateway, etc.

[0069] In another embodiment, the packet comparison operation of thecodec is performed by another component. For example, the VAD and/oracoustic prioritization agent performs these functions.

[0070] In another embodiment, the level of confidence determination ofthe VAD is performed by another component. For example, the codec and/oracoustic prioritization agent performs these functions.

[0071] In yet a further embodiment, the codec and/or VAD, during packetstructure processing attempt to identify acoustic events of greatimportance, such as plosives. When such acoustic events are identified(e.g., when the difference identified by the codec exceeds apredetermined threshold), the acoustic prioritizing agent 232 can causethe packets corresponding to the packet structures to have extremelyhigh priorities and/or be marked with value markers indicating that thepacket is not to be dropped under any circumstances. The loss of apacket containing such important acoustic events often cannot bereconstructed accurately by the packet loss concealment agent 328.

[0072] In yet a further embodiment, the analyses performed by the codec,VAD, and acoustic prioritizing agent are performed on a frame levelrather than a packet level. “Silent” frames and/or acoustically similarframes are omitted from the packet payloads. The procedural mechanismsfor these embodiments are similar to that for packets in FIGS. 4 and 5.In fact, the replacement of “frame” for “packet structure” and “packet”in FIGS. 4 and 5 provides a configuration of this embodiment.

[0073] In yet another embodiment, the algorithms of FIGS. 6 and 8 arestate driven. In other words, the algorithms are not triggered untilnetwork congestion exceeds a predetermined amount. The trigger for thestate to be entered can be based on any of the performance parametersset forth above increasing above or decreasing below predeterminedthresholds.

[0074] In yet a further embodiment, the dropping of packets based on thevalue of the value marker is performed by an intermediate node, such asa router. This embodiment is particularly useful in a network employingany of the Multi Protocol Labeling Switching, ATM, and IntegratedServices Controlled Load and Differentiate Services.

[0075] In yet a further embodiment, the positions of the codec andadaptive playout unit in FIG. 3 are reversed. Thus, the receive buffer336 contains encoded packets rather than decoded packets.

[0076] In yet a further embodiment, the acoustic prioritization agent232 processes packet structures before and/or after encryption.

[0077] In yet a further embodiment, a value marker is not employed andthe buffer manager itself performs the packet/frame comparison toidentify acoustically similar packets that can be expended in the eventthat buffer length/delay reaches undesired levels.

[0078] In other embodiments, the VAD 220, codec 224, acousticprioritization agent 232, and/or buffer manager 332 are implemented assoftware and/or hardware, such as a logic circuit, e.g., an ApplicationSpecific Integrated Circuit or ASIC.

[0079] The present invention, in various embodiments, includescomponents, methods, processes, systems and/or apparatus substantiallyas depicted and described herein, including various embodiments,subcombinations, and subsets thereof. Those of skill in the art willunderstand how to make and use the present invention after understandingthe present disclosure. The present invention, in various embodiments,includes providing devices and processes in the absence of items notdepicted and/or described herein or in various embodiments hereof,including in the absence of such items as may have been used in previousdevices or processes, e.g., for improving performance, achieving easeand\or reducing cost of implementation.

[0080] The foregoing discussion of the invention has been presented forpurposes of illustration and description. The foregoing is not intendedto limit the invention to the form or forms disclosed herein. Althoughthe description of the invention has included description of one or moreembodiments and certain variations and modifications, other variationsand modifications are within the scope of the invention, e.g., as may bewithin the skill and knowledge of those in the art, after understandingthe present disclosure. It is intended to obtain rights which includealternative embodiments to the extent permitted, including alternate,interchangeable and/or equivalent structures, functions, ranges or stepsto those claimed, whether or not such alternate, interchangeable and/orequivalent structures, functions, ranges or steps are disclosed herein,and without intending to publicly dedicate any patentable subjectmatter.

What is claimed is:
 1. A method for processing voice communications overa data network, comprising: (a) receiving a voice stream from a user,the voice stream comprising a plurality of temporally distinct segments;(b) processing at least one selected first segment of the voice stream,wherein the processing step comprises at least one of the followingsubsteps: (i) determining whether or not the contents of the selectedfirst segment are the product of voice activity and, when the contentsare determined not to be the product of voice activity, a level ofconfidence that the voice activity determination is accurate; (ii)determining a type of voice activity associated with the contents of thefirst segment; and (iii) comparing the first segment with a secondsegment of the voice stream to determine a degree of acoustic similaritybetween the first and second segments, wherein the processing of thefirst segment is based on at least one of the level of confidence, thetype of voice activity, and the degree of acoustic similarity.
 2. Themethod of claim 1, further comprising: (c) based on the at least one ofthe level of confidence, type of voice activity and the degree ofacoustic similarity, assigning an importance to the first segment. 3.The method of claim 2, wherein the importance is a value marker andfurther comprising: incorporating the value marker into a first packetcomprising the first segment.
 4. The method of claim 2, wherein theimportance is a service class assigned to a first packet comprising thefirst segment.
 5. The method of claim 2, wherein the importance is atransmission priority assigned to a first packet comprising the firstsegment.
 6. The method of claim 1, wherein in the processing step afirst packet comprising the first segment is not transmitted when the atleast one of the level of confidence and the degree of acousticsimilarity is one of less than and greater than a predeterminedthreshold.
 7. The method of claim 6, further comprising: varying thepredetermined threshold based on at least one of jitter, latency, anumber of missing packets, a number of packets received out-of-order, aprocessing delay, a propagation delay, a receive buffer delay, and anumber of packets enqueued in a receive buffer.
 8. The method of claim1, wherein the processing step comprises substep (i).
 9. The method ofclaim 1, wherein the processing step comprises substep (iii).
 10. Themethod of claim 9, wherein the second segment temporally precedes thefirst segment and a third segment temporally follows the first segmentand wherein substep (ii) comprises: comparing the first segment with thesecond segment of the voice stream to determine a first degree ofacoustic similarity between the first and second segments; and comparingthe first segment with the third segment of the voice stream todetermine a second degree of acoustic similarity between the first andthird segments.
 11. The method of claim 10, wherein the processing stepis based on at least one of the first and second degrees of acousticsimilarity one of exceeding or being less than a selected similaritythreshold.
 12. The method of claim 1, wherein the first segmentcorresponds to a payload of a first packet.
 13. The method of claim 1,wherein the first segment corresponds to a frame of a first packet. 14.The method of claim 1, wherein different classes of services are usedfor different segments of the voice stream.
 15. The method of claim 1,wherein different transmission priorities are used for differentsegments of the voice stream.
 16. The method of claim 1, wherein theprocessing step comprises substep (ii).
 17. The method of claim 16,wherein the type of voice activity is a plosive.
 18. The method of claim9, wherein a first packet associated with the first segment is nottransmitted and further comprising: later reconstructing the firstsegment with a packet loss concealment algorithm.
 19. The method ofclaim 3, further comprising: when the value of the value marker is oneof less than and greater than a predetermined value threshold, removingthe first packet from a receive buffer.
 20. A computer readable mediumcontaining instructions to perform the steps of claim
 1. 21. a logiccircuit configured to perform the steps of claim
 1. 22. A method formanaging a receive buffer, comprising: providing a receive buffer, thereceive buffer containing a plurality of packets associated with voicecommunications; and based on a level of importance associated with atleast some of the plurality of packets, removing at least some of thepackets from the receive buffer while leaving other packets in thereceive buffer.
 23. The method of claim 22, wherein the level ofimportance associated with each packet is indicated by a correspondingvalue marker.
 24. The method of claim 22, further comprising:determining when at least one of a delay associated with the receivebuffer and a length of the receive buffer exceeds a predetermined level;when the at least one of a delay and length exceeds the predeterminedlevel, performing the removing step; and when the at least one of adelay and length does not exceed the predetermined level, not performingthe removing step.
 25. The method of claim 22, further comprising: forat least some packets remaining in the receive buffer, resetting apacket counter indicating an ordering of the packets.
 26. The method ofclaim 22, further comprising: assigning a packet counter of a removedpacket to a packet remaining in the receive buffer.
 27. The method ofclaim 22, further comprising: performing time compression around atleast one of the removed packets.
 28. The method of claim 22 furthercomprising before the removing step: receiving a voice stream from auser, the voice stream comprising a plurality of temporally distinctsegments; processing at least one selected first segment of the voicestream, wherein the processing step comprises at least one of thefollowing substeps: determining whether or not the contents of theselected first segment are the product of voice activity and, when thecontents are determined to be a product of voice activity, a level ofconfidence that the voice activity determination is accurate;determining a type of voice activity associated with the contents of thefirst segment; and comparing the first segment with a second segment ofthe voice stream to determine a degree of acoustic similarity betweenthe first and second segments, wherein the processing of the firstsegment is based on at least one of the level of confidence, the type ofvoice activity, and the degree of acoustic similarity.
 29. The method ofclaim 28, further comprising: based on the at least one of the level ofconfidence, the type of voice activity, and the degree of acousticsimilarity, assigning an importance to the first segment.
 30. The methodof claim 29, wherein the importance is a value marker and furthercomprising: incorporating the value marker into a first packetcomprising the first segment.
 31. The method of claim 29, wherein theimportance is a service class assigned to a first packet comprising thefirst segment.
 32. The method of claim 29, wherein the importance is atransmission priority assigned to a first packet comprising the firstsegment.
 33. A computer readable medium to perform the steps of claim22.
 34. A logic circuit configured to perform the steps of claim
 22. 35.A system for transmitting voice communications over a data network,comprising: (a) an input operable to receive a voice stream from a user,the voice stream comprising a plurality of temporally distinct segments;(b) a packet protocol interface operable to convert at least oneselected first segment of the voice stream into at least a first packet;and (c) an acoustic prioritization agent operable to control processingof at least one of the first segment and the at least a first packetbased on at least one of a level of confidence that the contents of theselected first segment are not the product of voice activity, a type ofvoice activity associated with the contents of the first segment, and adegree of acoustic similarity between the first segment and a secondsegment of the voice stream.
 36. The system of claim 35, wherein theacoustic prioritization agent is operable to assign an importance to theat least one of the first segment and the at least a first packet basedon the at least one of the level of confidence, type of voice activityand the degree of acoustic similarity.
 37. The system of claim 36,wherein the importance is a value marker and the packet protocolinterface is operable to incorporate the value marker into the at leasta first packet.
 38. The system of claim 36, wherein the importance is aservice class assigned to the at least a first packet.
 39. The system ofclaim 36, wherein the importance is a transmission priority assigned tothe at least a first packet.
 40. The system of claim 35, wherein thepacket protocol interface is operable to not transmit the at least afirst packet when the at least one of the level of confidence and thedegree of acoustic similarity is one of less than and greater than apredetermined threshold.
 41. The system of claim 40, wherein thepredetermined threshold is varied based on at least one of jitter,latency, a number of missing packets, a number of packets receivedout-of-order, a processing delay, a propagation delay, and a receivebuffer length.
 42. The system of claim 35, wherein the at least one ofthe level of confidence, type of voice activity and the degree ofacoustic similarity is the level of confidence.
 43. The system of claim35, wherein the at least one of the level of confidence, type of voiceactivity and the degree of acoustic similarity is the degree of acousticsimilarity.
 44. The system of claim 43, wherein the second segmenttemporally precedes the first segment and a third segment of the voicestream temporally follows the first segment and further comprising: acodec operable to compare the first segment with the second segment ofthe voice stream to determine a first degree of acoustic similaritybetween the first and second segments and compare the first segment withthe third segment of the voice stream to determine a second degree ofacoustic similarity between the first and third segments.
 45. The systemof claim 44, wherein the prioritization agent controls processing of theat least one of the first segment and the at least a first packet basedon at least one of the first and second degrees of acoustic similarityone of exceeding or being less than a selected similarity threshold. 46.The system of claim 35, wherein the first segment corresponds to apayload of the at least a first packet.
 47. The system of claim 35,wherein the first segment corresponds to a frame of the at least a firstpacket.
 48. The system of claim 35, wherein different classes ofservices are used for different segments of the voice stream.
 49. Thesystem of claim 35, wherein the packet protocol interface is operable touse different transmission priorities for different segments of thevoice stream.
 50. The system of claim 35, wherein the at least one ofthe level of confidence, type of voice activity and the degree ofacoustic similarity is the type of voice activity.
 51. The system ofclaim 50, wherein the type of voice activity is a plosive.
 52. Thesystem of claim 43, wherein the at least a packet is not transmitted andfurther comprising: a packet loss concealment agent operable to laterreconstructing the first segment.
 53. The system of claim 37, furthercomprising: a buffer manager operable to remove the at least a firstpacket from a receive buffer when the value of the value marker is oneof less than and greater than a predetermined value threshold.
 54. Asystem for managing a receive buffer, comprising: a receive buffercontaining a plurality of packets associated with voice communications;and a buffer manager operable to remove at least some of the packetsfrom the receive buffer while leaving other packets in the receivebuffer based on a level of importance associated with at least some ofthe plurality of packets.
 55. The system of claim 54, wherein the levelof importance of the at least some packets is indicated by acorresponding value marker.
 56. The system of claim 54, wherein thebuffer manager is further operable to: determine if at least one of adelay associated with the receive buffer and a number of packetsenqueued in the receive buffer exceeds a predetermined level; when theat least one of a delay and number of packets exceeds the predeterminedlevel, performing the removing step; and when the at least one of adelay and number of packets does not exceed the predetermined level, notperforming the removing step.
 57. The system of claim 54, wherein thebuffer manager is operable, for at least some packets remaining in thereceive buffer, to reset a packet counter indicating an ordering of thepackets.
 58. The system of claim 54, wherein the buffer manager isoperable to assign a packet counter of a removed packet to a packetremaining in the receive buffer.
 59. The system of claim 54, wherein thebuffer manager is operable to perform time compression around at leastone of the removed packets.
 60. A packet, comprising: a packet headercomprising transmission information and a payload comprising one or moreframes of a voice stream, wherein at least one of the packet header andpayload comprises a value of the value marker is indicative of a levelof importance of the payload to maintaining a selected quality of voicecommunication.
 61. The packet of claim 60, wherein the transmissioninformation comprises the value marker.
 62. The packet of claim 60,wherein the payload comprises the value marker.
 63. A method forprocessing voice communications over a data network, comprising: (a)receiving a first voice stream from a first user, the voice streamcomprising a plurality of temporally distinct segments associated with aplurality of packets and the voice stream being a part of a sessionbetween at least the first user and a second user, wherein the sessionhas an associated at least one of a a jitter value, a latency value, anumber of missing packets, a number of packets received out-of-order, aprocessing delay, a propagation delay, a receive buffer delay, and anumber of packets enqueued in a receive buffer and (b) comparing the atleast one of a jitter value, a latency value, a number of missingpackets, a number of packets received out-of-order, a processing delay,a propagation delay, a receive buffer delay, and a number of packetsenqueued in a receive buffer with a predetermined threshold; (i) whenthe at least one of a jitter value, a latency value, a number of missingpackets, a number of packets received out-of-order, a processing delay,a propagation delay, a receive buffer delay, and a number of packetsenqueued in a receive buffer exceeds the predetermined threshold, nottransmitting at least some of the plurality of packets and (ii) when theat least one of a jitter value, a latency value, a number of missingpackets, a number of packets received out-of-order, a processing delay,a propagation delay, a receive buffer delay, and a number of packetsenqueued in a receive buffer is less than the predetermined threshold,transmitting the at least some of the plurality of packets.
 64. Themethod of claim 63, further comprising at least one of the followingsteps: (c) determining whether or not the contents of a selected firstsegment of the plurality of temporally distinct segments are the productof voice activity and, when the contents are determined not to be theproduct of voice activity, a level of confidence that the voice activitydetermination is accurate; (d) determining a type of voice activityassociated with the contents of the first segment; and (e) comparing thefirst segment with a second segment of the plurality of temporallydistinct segments to determine a degree of acoustic similarity betweenthe first and second segments, wherein the processing of the firstsegment is based on at least one of the level of confidence, the type ofvoice activity, and the degree of acoustic similarity.
 65. The method ofclaim 64, further comprising: (c) based on the at least one of the levelof confidence, type of voice activity and the degree of acousticsimilarity, assigning an importance to the first segment.
 66. The methodof claim 64, wherein the importance is a value marker and furthercomprising: incorporating the value marker into a first packetcomprising the first segment.
 67. The method of claim 64, wherein theimportance is a service class assigned to a first packet comprising thefirst segment.
 68. The method of claim 64, wherein the importance is atransmission priority assigned to a first packet comprising the firstsegment.
 69. The method of claim 63, wherein in the processing step afirst packet comprising the first segment is not transmitted when the atleast one of the level of confidence and the degree of acousticsimilarity is one of less than and greater than a predeterminedthreshold.
 70. The method of claim 64, wherein step (c) is performed.71. The method of claim 64, wherein step (d) is performed.
 72. Themethod of claim 64, wherein step (d) is performed.
 73. The method ofclaim 64, wherein the type of voice activity is a plosive.
 74. Themethod of claim 69, wherein a first packet associated with the firstsegment is not transmitted and further comprising: (f) laterreconstructing the first segment with a packet loss concealmentalgorithm.